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Abstract - VFe introduce a simple and com- 
putationally trivial method for binary clas- 
sification based on the evaluation of poten- 
tial functions. We demonstrate that despite 
the conceptual and computational simplicity 
of the method its performance can match or 
exceed that of standard Support Vector Ma- 
chine methods. 

Keywords: Machine Learning, Mi- 
croarray Data 

1 Introduction 

Binary classification is a fundamental focus 
in machine learning and informatics with 
many possible applications. For instance in 
biomedicine, the introduction of microarray 
and proteomics data has opened the door to 
connecting a molecular snapshot of an indi- 
vidual with the presence or absence of a dis- 
ease. However, microarray data sets can con- 
tain tens to hundreds of thousands of obser- 
vations and are well known to be noisy [2]. 
Despite this complexity, algorithms exist that 
are capable of producing very good perfor- 
mance [TQl [11]. Most notable among these 
methods are the Support Vector Machine 
(SVM) methods. In this paper we introduce 
a simple and computationally trivial method 
for binary classification based on potential 
functions. This classifier, which we will call 
the potential method, is in a sense a general- 
ization of the nearest neighbor methods and 
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is also related to radial basis function net- 
works (RBFN) [1], another method of current 
interest in machine learning. Further, the 
method can be viewed as one possible non- 
linear version of Distance Weighted Discrim- 
ination (DWD), a recently proposed method 
whose linear version consists of choosing a de- 
cision plane by minimizing the sum of the in- 
verse distances to the plane [8]. 

Suppose that {y^}™ x is a set of data of one 
type, that we will call positive and {zj}" =1 is a 
data set of another type that we call negative. 
Suppose that both sets of data are vectors in 
M. N . We will assume that M. N decomposes 
into two sets Y and Z such that each y.; £ Y, 
Zj £ Z and any point in Y should be classi- 
fied as positive and any point in Z should be 
classified as negative. Suppose that x £ 
and we wish to predict whether x belongs to 
Y or Z using only information from the finite 
sets of data {yj} and {zj}. Given distance 
functions d\(-, •) and c^-, ■) and positive con- 
stants {a,}™^, {bi}2 =1 , a and (3 we define a 
potential function: 



E 

i=l 



If /(x) > then we say that / classifies x 
as belonging to Y and if 7(x) is negative 
then x is classified as part of Z. The set 
J(x) = we call the decision surface. Un- 
der optimal circumstances it should coincide 
with the boundary between Y and Z. 

Provided that d± and d 2 are sufficiently 
easy to evaluate, then evaluating J(x) is com- 
putationally trivial. This fact could make it 
possible to use the training data to search 



for optimal choices of {a i }'^ =l) {6j}" =1 , a, (3 
and even the distance functions dj. An ob- 
vious choice for d\ and d 2 is the Euclidean 
distance. More generally, d could be chosen 
as the distance defined by the £ p norm, i.e. 
a?(x, y) = ||x — y ||p where 

\\ x \\ p =( a * + a * + ... + a* f )V*. (2) 

A more elaborate choice for a distance 
d might be the following. Let c = 
(ci, c 2 , . . . , cjv) be an N- vector and define d c 
to be the c-weighted distance: 

d CiP (x,y) = (ci|xi - |/i| p + c 2 \x 2 - y 2 \ p 

+ ...+ c N \x N -y N \v) 1/p . (3) 

This distance allows assignment of different 
weights to the various attributes. Many 
methods for choosing c might be suggested 
and we propose a few here. Let C be the vec- 
tor associated with the classification of the 
data points, Cj = ±1 depending on the clas- 
sification of the i-th data point. The vector c 
might consist of the absolute values univari- 
ate c orrelation coefficients associated with 
the iV variables with respect to C. This 
would have the effect of emphasizing direc- 
tions which should be emphasized, but very 
well might also suppress directions which are 
important for multi-variable effects. Choos- 
ing c to be 1 minus the univariate p-values 
associated with each variable could be ex- 
pected to have a similar effect. Alterna- 
tively, c might be derived from some multi- 
dimensional statistical methods. In our ex- 
periments it turns out that 1 minus the p- 
values works quite well. 

Rather than <ij = bi = 1 we might con- 
sider other weightings of training points. We 
would want to make the choice of a = 
(ai, a 2 , • • • , a m ) and b = (61, b 2 , . . . , b n ) based 
on easily available information. An obvious 
choice is the set of distances to other test 



points. In the checkerboard experiment be- 
low we demonstrate that training points too 
close to the boundary between Y and Z have 
undue influence and cause irregularity in the 
decision curve. We would like to give less 
weight to these points by using the distance 
from the points to the boundary. However, 
since the boundary is not known, we use the 
distance to the closest point in the other set 
as an approximation. We show that this 
approach gives improvement in classification 
and in the smoothness of the decision surface. 

Note that if p = 2 in ([2]) our method limits 
onto the usual nearest neighbor method as 
a = (3 —>■ 00 since for large a the term with 
the smallest denominator will dominate the 
sum. For finite a our method gives greater 
weight to nearby points. 

In the following we report on tests of the 
efficacy of the method using various £ p norms 
as the distance, various choices of a = f3 and 
a few simple choices for c, a, and b. 

2 A Simple Test Model 

We applied the method to the model prob- 
lem of a 4 by 4 checkerboard. In this test we 
suppose that a square is partitioned into a 
16 equal subsquares and suppose that points 
in alternate squares belong to two distinct 
types. Following |7], we used 1000 randomly 
selected points as the training set and 40,000 
grid points as the test set. We choose to de- 
fine both the distance functions by the usual 
£p norm. We will also require a = (3 and 
O'i — h — 1 Thus we used as the potential 
function: 

/(x) = V „ „ -E n „ • ( 4 ) 

trllx-y.ll? t?||x-*||- 

Using different values of a and p we found the 
percentage of test points that are correctly 
classified by /. We repeated this experiment 
on 50 different training sets and tabulated the 



percentage of correct classifications func- 
tion of a and p. The results are displayed in 
Figure HJ We find that the maximum occurs 
at approximately p — 1.5 and a = 4.5. 



a k 3.5. In this optimization we also used 
the distance to opposite type to a power (3 
and the optimal value for (3 was about 3.5. In 
[7J a SVM method obtained 97% correct clas- 
sification, but only after 100,000 iterations. 




Figure 1: The percentage of correct classifica- 
tions for the 4 by 4 checkerboard test problem as 
a function of the parameters a and p. The max- 
imum occurs near p = 1.5 and a = 4.5. Notice 
that the graph is fairly fiat near the maximum. 

The relative flatness near the maximum in 
Figure 1 indicates robustness of the method 
with respect to these parameters. We fur- 
ther observed that changing the training set 
affects the location of the maximum only 
slightly and the affect on the percentage cor- 
rect is small. 

Finally, we tried classification of the 4 by 
4 checkerboard using the minimal distance to 
data of the opposite type in the coefficients 
for the training data, i.e. a and b in: 



J(x) 
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With this we obtained 96.2% accuracy in the 
classification and a noticably smoother deci- 
sion surface (see Figure 0(b)). The optimized 
parameters for our method were p pa 3.5 and 




1.5 2 2.5 3 35 



Figure 2: (a) The classification of the 4 by 
4 checkerboard without distance to boundary 
weights. In this test 95% were correctly clas- 
sified, (b) The classification using distance to 
boundary weights. Here 96.2% were correctly 
classified. 



3 Clinical Data Sets 

Next we applied the method to micro- 
array data from two cancer study sets 
Prostate.Tumor and DLBCL [TQl [TJJ. Based 
on our experience in the previous problem, 



we used the potential function: 



E 



E 



1 - e )6: 



(5) 

where d CiP is the metric defined in ((31). The 
vector Ci was taken to be 1 minus the uni- 
variate p- value for each variable with respect 
to the classification. The weights aj, hi were 
taken to be the distance from each data point 
to the nearest data point of the opposite type. 
Using the potential ([5]) we obtained leave- 
one-out cross validation (LOOCV) for vari- 
ous values of p, a, j3, and e. For these data 
sets LOOCV has been shown to be a valid 
methodology [1U] 

On the DLBCL data the nearly optimal per- 
formance of 98.7% was acheived for many 
parameter combinations. The SVM meth- 
ods studied in pJJ, [TT] achieved 97.5% cor- 
rect on this data while the /c-nearest neight- 
bor correctly classified only 87%. Specifi- 
cally, we found that for each 1.6 < p < 2.4 
there were robust sets of parameter combina- 
tions that produced performance better than 
SVM. These parameter sets were contained 
generally in the intervals: 10 < a < 15 and 
10 < (3 < 15 and < e < .5. 

For the DLBCL data when we used the i v 
norm instead of the weighted distances and 
also dropped the data weights (e = (3 = 0) 
the best performance sank to 94.8% correct 
classification at (p, a) = (2,6). This illus- 
trates the importance of these parameters. 

For the Prostrate_tumor data set the re- 
sults using potential ([5]) were not quite as 
good. The best performance, 89.2% correct, 
occured for 1.2 < p < 1.6 with a G [11.5, 15, 
P E [12,14], e E [.1, .175] . In PU EE] vari- 
ous SVM methods were shown to achieve 92% 
correct and the fc-nearest neighbor method 
acheived 85% correct. With feature selection 
we were able to obtain much better results on 
the Prostrate_tumor data set. In particular, 



we used the univariate p-values to select the 
most relevant features. The optimal perfor- 
mance occured with 20 features. In this test 
we obtain 96.1% accuracy for a robust set of 
parameter values. 



data set 


kNN SVM Pot Pot-FS 


DLBCL 

Prostate 


87% 97.5% 98.7% 

85% 92% 89.2% 96.1% 



Table 1: Results from the potential method on 
benchmark DLBCL and Prostate_tumor micro- 
array data sets compared with the SVM methods 
and the k-nearest neighbor method. The last col- 
umn is the performance of the potential method 
with univariate feature selection. 

4 Conclusions 

The results demonstrate that, despite its sim- 
plicity, the potential method can be as ef- 
fective as the SVM methods. Further work 
needs to be done to realize the maximal per- 
formance of the method. It is important that 
most of the calculations required by the po- 
tential method are mutually independent and 
so are highly parallelizable. 

We point out an important difference be- 
tween the potential method and Radial Basis 
Function Networks. RBFNs were originally 
designed to approximate a real-valued func- 
tion on M, N . In classification problems, the 
RBFN attempts to approximate the charac- 
teristic functions of the sets Y and Z (see 
[4j). A key point of our method is to approx- 
imate the decision surface only. The poten- 
tial method is designed for classifcation prob- 
lems whereas RBFNs have many other appli- 
cations in machine learning. 

We also note that the potential method, 
by putting signularities at the known data 
points, always classifies some neighborhood 
of a data point as being in the class of 
that point. This feature makes the poten- 
tial method less suitable when the decision 



surface is in fact not a surface, but a "fuzzy" 
boundary region. 

There are several avenues of investigation 
that seem to be worth pursuing. Among 
these, we have further investigated the role of 
the distance to the boundary with success [1]. 
Another direction of interest would be to ex- 
plore alternative choices for the weightings c, 
a and b. Another would be to investigate the 
use of more general metrics by searching for 
optimal choices in a suitable function space 
|9j. Implementation of feature selection with 
the potential method is also likely to be fruit- 
ful. Feature selection routines already exist 
in the context of /c-nearest neighbor mathods 
j6] and those can be expected to work equally 
well for the potential method. Feature selec- 
tion is recongnized to be very important in 
micro-array analysis, and we view the suc- 
cess of the method without feature selection 
and with primative feature selection as a good 
sign. 
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